Banks and credit card companies calculate our credit score to determine our creditworthiness. It helps banks and credit card companies immediately to issue loans to customers with good creditworthiness. Today banks and credit card companies use Machine Learning algorithms to classify all the customers in their database based on their credit history.

Credit Score Classification¶

There are three credit scores that banks and credit card companies use to label their customers:

Good
Standard
Poor

A person with a good credit score will get loans from any bank and financial institution. For the task of Credit Score Classification, we need a labelled dataset with credit scores.

I found an ideal dataset for this task labelled according to the credit history of credit card customers from kaggle

Data_Source: Kaggle -> https://www.kaggle.com/parisrohan

Rational for the Project¶

The project aims to analyze a dataset related to credit scores and financial behavior to gain insights into the factors that influence credit scores and financial health. The research questions to be analyzed include:

  • What are the key factors that impact credit scores?
  • How does financial behavior and credit utilization affect credit scores?
  • Can we predict credit scores based on financial and demographic variables?

Description of the Dataset¶

The dataset contains information about individuals' financial and credit-related attributes, such as annual income, monthly salary, number of bank accounts, credit card usage, loan details, credit history, and credit scores. The data has been collected through financial institutions and credit agencies, and it provides insights into the financial behavior and creditworthiness of individuals.

Key Characteristics¶

  • The dataset includes both numerical and categorical variables.
  • It provides a comprehensive view of individuals' financial profiles and credit scores.
  • The dataset allows for the exploration of relationships between financial attributes and credit scores.

    Limitations


  • The dataset may contain missing values and outliers that need to be addressed.
  • The data may have been collected from a specific region or demographic, limiting its generalizability.
  • The accuracy of the credit scores and financial information may vary based on the source of the data.

I start this task of Credit score classification by importing the necessary python librariries and the dataset.

  1. Pandas: As an open-source software library built on top of Python specifically for data manipulation and analysis, Pandas offers data structure and operations for powerful, flexible, and easy-to-use data analysis and manipulation.

  2. Numpy: NumPy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level mathematical functions that operate on these arrays and matrices.

  3. tqdm : The tqdm library to create a progress bar during the iteration through an iterable (e.g., a loop). This progress bar visually shows the progress of the loop, making it easier to understand how much of the iteration has been completed. The tqdm.auto module is used to automatically select the best implementation for the current environment. The progress bar appears with a description ("Processing" in this case) and updates in real-time as the loop progresses. It's a useful tool for monitoring the execution progress of tasks, especially when dealing with time-consuming operations.

  4. matplotlib.pyplot: matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

  5. seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

  6. Plotly.express: Plotly Express is the easy-to-use, high-level interface to plotly, which operates on a variety of types of data & produces easy-to-style figures.Plotly Express provides functions to visualise a variety types of data. Most functions such as Px.bar or Px.

  7. Plotly_graph_objects: The plotly. graph_objects module (typically imported as go ) contains an automatically-generated hierarchy of Python classes which represent non-leaf nodes in this figure schema. The term "graph objects" refers to instances of these classes. The primary classes defined in the plotly.

  8. plotly.io: low-level interface for displaying, reading and writing figures. Return a copy of a figure where all styling properties have been moved into the figure's template. Convert a figure to an HTML string representation.

In [19]:
#!pip install plotly.express
In [20]:
# Let's start by loading the data and taking a high-level view of its structure, contents, and statistics.

import pandas as pd
import numpy as np
from tqdm.auto import tqdm
tqdm.pandas()
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "plotly_white"
In [21]:
# Loading the dataset

data = pd.read_csv('train.csv')
#, encoding='ascii'

Viewing the dataframe¶

We can get a quick sense of the size of our dataset by using the shape method. This returns a tuple with the number of rows and columns in the dataset.

In [30]:
#Return number of rows and columns
data.shape
Out[30]:
(100000, 28)
In [22]:
# Displaying the head and tail of the dataframe which show few rows from start and end of dataframe

data.head(10)
Out[22]:
ID Customer_ID Month Name Age SSN Occupation Annual_Income Monthly_Inhand_Salary Num_Bank_Accounts ... Credit_Mix Outstanding_Debt Credit_Utilization_Ratio Credit_History_Age Payment_of_Min_Amount Total_EMI_per_month Amount_invested_monthly Payment_Behaviour Monthly_Balance Credit_Score
0 5634 3392 1 Aaron Maashoh 23.0 821000265.0 Scientist 19114.12 1824.843333 3.0 ... Good 809.98 26.822620 265.0 No 49.574949 21.465380 High_spent_Small_value_payments 312.494089 Good
1 5635 3392 2 Aaron Maashoh 23.0 821000265.0 Scientist 19114.12 1824.843333 3.0 ... Good 809.98 31.944960 266.0 No 49.574949 21.465380 Low_spent_Large_value_payments 284.629162 Good
2 5636 3392 3 Aaron Maashoh 23.0 821000265.0 Scientist 19114.12 1824.843333 3.0 ... Good 809.98 28.609352 267.0 No 49.574949 21.465380 Low_spent_Medium_value_payments 331.209863 Good
3 5637 3392 4 Aaron Maashoh 23.0 821000265.0 Scientist 19114.12 1824.843333 3.0 ... Good 809.98 31.377862 268.0 No 49.574949 21.465380 Low_spent_Small_value_payments 223.451310 Good
4 5638 3392 5 Aaron Maashoh 23.0 821000265.0 Scientist 19114.12 1824.843333 3.0 ... Good 809.98 24.797347 269.0 No 49.574949 21.465380 High_spent_Medium_value_payments 341.489231 Good
5 5639 3392 6 Aaron Maashoh 23.0 821000265.0 Scientist 19114.12 1824.843333 3.0 ... Good 809.98 27.262259 270.0 No 49.574949 21.465380 High_spent_Medium_value_payments 340.479212 Good
6 5640 3392 7 Aaron Maashoh 23.0 821000265.0 Scientist 19114.12 1824.843333 3.0 ... Good 809.98 22.537593 271.0 No 49.574949 21.465380 Low_spent_Small_value_payments 244.565317 Good
7 5641 3392 8 Aaron Maashoh 23.0 821000265.0 Scientist 19114.12 1824.843333 3.0 ... Good 809.98 23.933795 272.0 No 49.574949 21.465380 High_spent_Medium_value_payments 358.124168 Standard
8 5646 8625 1 Rick Rothackerj 28.0 4075839.0 Teacher 34847.84 3037.986667 2.0 ... Good 605.03 24.464031 319.0 No 18.816215 39.684018 Low_spent_Small_value_payments 470.690627 Standard
9 5647 8625 2 Rick Rothackerj 28.0 4075839.0 Teacher 34847.84 3037.986667 2.0 ... Good 605.03 38.550848 320.0 No 18.816215 39.684018 High_spent_Large_value_payments 484.591214 Good

10 rows × 28 columns

In [23]:
data.tail(10)
Out[23]:
ID Customer_ID Month Name Age SSN Occupation Annual_Income Monthly_Inhand_Salary Num_Bank_Accounts ... Credit_Mix Outstanding_Debt Credit_Utilization_Ratio Credit_History_Age Payment_of_Min_Amount Total_EMI_per_month Amount_invested_monthly Payment_Behaviour Monthly_Balance Credit_Score
99990 155616 34304 7 Sarah McBridec 28.0 31350942.0 Architect 20002.88 1929.906667 10.0 ... Bad 3571.70 25.123535 74.0 Yes 60.964772 34.662906 Low_spent_Large_value_payments 228.750392 Standard
99991 155617 34304 8 Sarah McBridec 29.0 31350942.0 Architect 20002.88 1929.906667 10.0 ... Bad 3571.70 37.140784 75.0 Yes 60.964772 34.662906 High_spent_Large_value_payments 337.362988 Standard
99992 155622 37932 1 Nicks 24.0 78735990.0 Mechanic 39628.99 3359.415833 4.0 ... Good 502.38 32.991333 375.0 No 35.104023 24.028477 Low_spent_Small_value_payments 189.641080 Poor
99993 155623 37932 2 Nicks 25.0 78735990.0 Mechanic 39628.99 3359.415833 4.0 ... Good 502.38 29.135447 376.0 No 35.104023 24.028477 Low_spent_Medium_value_payments 400.104466 Standard
99994 155624 37932 3 Nicks 25.0 78735990.0 Mechanic 39628.99 3359.415833 4.0 ... Good 502.38 39.323569 377.0 No 35.104023 24.028477 High_spent_Medium_value_payments 410.256158 Poor
99995 155625 37932 4 Nicks 25.0 78735990.0 Mechanic 39628.99 3359.415833 4.0 ... Good 502.38 34.663572 378.0 No 35.104023 24.028477 High_spent_Large_value_payments 479.866228 Poor
99996 155626 37932 5 Nicks 25.0 78735990.0 Mechanic 39628.99 3359.415833 4.0 ... Good 502.38 40.565631 379.0 No 35.104023 24.028477 High_spent_Medium_value_payments 496.651610 Poor
99997 155627 37932 6 Nicks 25.0 78735990.0 Mechanic 39628.99 3359.415833 4.0 ... Good 502.38 41.255522 380.0 No 35.104023 24.028477 High_spent_Large_value_payments 516.809083 Poor
99998 155628 37932 7 Nicks 25.0 78735990.0 Mechanic 39628.99 3359.415833 4.0 ... Good 502.38 33.638208 381.0 No 35.104023 24.028477 Low_spent_Large_value_payments 319.164979 Standard
99999 155629 37932 8 Nicks 25.0 78735990.0 Mechanic 39628.99 3359.415833 4.0 ... Good 502.38 34.192463 382.0 No 35.104023 24.028477 High_spent_Medium_value_payments 393.673696 Poor

10 rows × 28 columns

In [25]:
# data.info(): This method returns the information about the dataframe including index dtype and columns, 
#non nulls values and memory usage.

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  int64  
 1   Customer_ID               100000 non-null  int64  
 2   Month                     100000 non-null  int64  
 3   Name                      100000 non-null  object 
 4   Age                       100000 non-null  float64
 5   SSN                       100000 non-null  float64
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  float64
 8   Monthly_Inhand_Salary     100000 non-null  float64
 9   Num_Bank_Accounts         100000 non-null  float64
 10  Num_Credit_Card           100000 non-null  float64
 11  Interest_Rate             100000 non-null  float64
 12  Num_of_Loan               100000 non-null  float64
 13  Type_of_Loan              100000 non-null  object 
 14  Delay_from_due_date       100000 non-null  float64
 15  Num_of_Delayed_Payment    100000 non-null  float64
 16  Changed_Credit_Limit      100000 non-null  float64
 17  Num_Credit_Inquiries      100000 non-null  float64
 18  Credit_Mix                100000 non-null  object 
 19  Outstanding_Debt          100000 non-null  float64
 20  Credit_Utilization_Ratio  100000 non-null  float64
 21  Credit_History_Age        100000 non-null  float64
 22  Payment_of_Min_Amount     100000 non-null  object 
 23  Total_EMI_per_month       100000 non-null  float64
 24  Amount_invested_monthly   100000 non-null  float64
 25  Payment_Behaviour         100000 non-null  object 
 26  Monthly_Balance           100000 non-null  float64
 27  Credit_Score              100000 non-null  object 
dtypes: float64(18), int64(3), object(7)
memory usage: 21.4+ MB
In [26]:
# Displaying summary statistics : The describe() function computes a summary of statistics pertaining to the DataFrame columns. 
#This function gives the mean, std and IQR values.

print(data.describe())
                  ID    Customer_ID          Month            Age  \
count  100000.000000  100000.000000  100000.000000  100000.000000   
mean    80631.500000   25982.666640       4.500000      33.316340   
std     43301.486619   14340.543051       2.291299      10.764812   
min      5634.000000    1006.000000       1.000000      14.000000   
25%     43132.750000   13664.500000       2.750000      24.000000   
50%     80631.500000   25777.000000       4.500000      33.000000   
75%    118130.250000   38385.000000       6.250000      42.000000   
max    155629.000000   50999.000000       8.000000      56.000000   

                SSN  Annual_Income  Monthly_Inhand_Salary  Num_Bank_Accounts  \
count  1.000000e+05  100000.000000          100000.000000      100000.000000   
mean   5.004617e+08   50505.123449            4197.270835           5.368820   
std    2.908267e+08   38299.422093            3186.432497           2.593314   
min    8.134900e+04    7005.930000             303.645417           0.000000   
25%    2.451686e+08   19342.972500            1626.594167           3.000000   
50%    5.006886e+08   36999.705000            3095.905000           5.000000   
75%    7.560027e+08   71683.470000            5957.715000           7.000000   
max    9.999934e+08  179987.280000           15204.633333          11.000000   

       Num_Credit_Card  Interest_Rate  ...  Delay_from_due_date  \
count    100000.000000   100000.00000  ...         100000.00000   
mean          5.533570       14.53208  ...             21.08141   
std           2.067098        8.74133  ...             14.80456   
min           0.000000        1.00000  ...              0.00000   
25%           4.000000        7.00000  ...             10.00000   
50%           5.000000       13.00000  ...             18.00000   
75%           7.000000       20.00000  ...             28.00000   
max          11.000000       34.00000  ...             62.00000   

       Num_of_Delayed_Payment  Changed_Credit_Limit  Num_Credit_Inquiries  \
count           100000.000000         100000.000000         100000.000000   
mean                13.313120             10.470323              5.798250   
std                  6.237166              6.609481              3.867826   
min                  0.000000              0.500000              0.000000   
25%                  9.000000              5.380000              3.000000   
50%                 14.000000              9.400000              5.000000   
75%                 18.000000             14.850000              8.000000   
max                 25.000000             29.980000             17.000000   

       Outstanding_Debt  Credit_Utilization_Ratio  Credit_History_Age  \
count     100000.000000             100000.000000       100000.000000   
mean        1426.220376                 32.285173          221.220460   
std         1155.129026                  5.116875           99.680716   
min            0.230000                 20.000000            1.000000   
25%          566.072500                 28.052567          144.000000   
50%         1166.155000                 32.305784          219.000000   
75%         1945.962500                 36.496663          302.000000   
max         4998.070000                 50.000000          404.000000   

       Total_EMI_per_month  Amount_invested_monthly  Monthly_Balance  
count        100000.000000            100000.000000    100000.000000  
mean            107.699208                55.101315       392.697586  
std             132.267056                39.006932       201.652719  
min               0.000000                 0.000000         0.007760  
25%              29.268886                27.959111       267.615983  
50%              66.462304                45.156550       333.865366  
75%             147.392573                71.295797       463.215683  
max            1779.103254               434.191089      1183.930696  

[8 rows x 21 columns]

data.dtypes: A data type object (an instance of numpy.dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. It describes the following aspects of the data: Type of the data (integer, float, Python object, etc.)

In [28]:
data.dtypes
Out[28]:
ID                            int64
Customer_ID                   int64
Month                         int64
Name                         object
Age                         float64
SSN                         float64
Occupation                   object
Annual_Income               float64
Monthly_Inhand_Salary       float64
Num_Bank_Accounts           float64
Num_Credit_Card             float64
Interest_Rate               float64
Num_of_Loan                 float64
Type_of_Loan                 object
Delay_from_due_date         float64
Num_of_Delayed_Payment      float64
Changed_Credit_Limit        float64
Num_Credit_Inquiries        float64
Credit_Mix                   object
Outstanding_Debt            float64
Credit_Utilization_Ratio    float64
Credit_History_Age          float64
Payment_of_Min_Amount        object
Total_EMI_per_month         float64
Amount_invested_monthly     float64
Payment_Behaviour            object
Monthly_Balance             float64
Credit_Score                 object
dtype: object
In [29]:
# To count the number of null values in a Pandas DataFrame, we can use the isnull() method to create a Boolean mask 
# and then use the sum() method to count the number of True values.

print(data.isnull().sum())
ID                          0
Customer_ID                 0
Month                       0
Name                        0
Age                         0
SSN                         0
Occupation                  0
Annual_Income               0
Monthly_Inhand_Salary       0
Num_Bank_Accounts           0
Num_Credit_Card             0
Interest_Rate               0
Num_of_Loan                 0
Type_of_Loan                0
Delay_from_due_date         0
Num_of_Delayed_Payment      0
Changed_Credit_Limit        0
Num_Credit_Inquiries        0
Credit_Mix                  0
Outstanding_Debt            0
Credit_Utilization_Ratio    0
Credit_History_Age          0
Payment_of_Min_Amount       0
Total_EMI_per_month         0
Amount_invested_monthly     0
Payment_Behaviour           0
Monthly_Balance             0
Credit_Score                0
dtype: int64

The value_counts() function returns a Series that contain counts of unique values. It returns an object that will be in descending order so that its first element will be the most frequently-occurred element. By default, it excludes NA values.

In [31]:
data["Credit_Score"].value_counts()
Out[31]:
Standard    53174
Poor        28998
Good        17828
Name: Credit_Score, dtype: int64
In [32]:
# Ploting distributions of a few key columns

# Selecting a few columns for distribution plots
columns_to_plot = ['Age', 'Annual_Income', 'Credit_Score']
In [33]:
# Plotting distributions
for column in columns_to_plot:
    plt.figure(figsize=(10, 5))
    sns.histplot(data[column].dropna(), kde=True)
    plt.title('Distribution of ' + column)
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.show()
  1. Visualize Different Aspects of the Dataset:
  • We will create visualizations for different features to understand their distributions, relationships, and potential impact on the target variable (Credit Score).
  • Advanced visualizations such as pair plots, heatmaps for correlation, and box plots for outlier detection will be used.
  1. Feature Analysis and Engineering:
  • We will analyze the features to understand their importance and relevance to the target variable.
  • Feature engineering will be performed here to create new features that could improve the model's performance.
  1. Machine Learning Tasks:
  • I will select appropriate machine learning models to predict the Credit Score or classify customers into different credit score categories.
  • We will train, validate, and test the models using the dataset.
  1. Analysis of Results:
  • We will evaluate the models using appropriate metrics and comment on the performance.
    • Visualizations will be used to interpret the results and understand the model's behavior.
In [35]:
# I will start with a heatmap to understand the correlations between different features. Then, 
#I will create box plots to detect outliers in key features.


# Calculating correlations
correlation_matrix = data.corr()

# Ploting the correlation matrix to identify key factors impacting credit scores
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
/tmp/ipykernel_1771/1018143902.py:6: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  correlation_matrix = data.corr()

The correlation matrix provides insights into the relationships between different variables and credit scores. This visualization helps identify the key factors that impact credit scores.

Data Exploration¶

The dataset has many features that can train a Machine Learning model for credit score classification. Let’s explore all the features one by one.

I will start by exploring the occupation feature to know if the occupation of the person affects credit scores:

In [36]:
fig = px.box(data, 
             x="Occupation",  
             color="Credit_Score", 
             title="Credit Scores Based on Occupation", 
             color_discrete_map={'Poor':'red',
                                 'Standard':'yellow',
                                 'Good':'green'})
fig.show()

There’s not much difference in the credit scores of all occupations mentioned in the data. Now i will explore whether the Annual Income of the person impacts our credit scores or not:

In [37]:
fig = px.box(data, 
             x="Credit_Score", 
             y="Annual_Income", 
             color="Credit_Score",
             title="Credit Scores Based on Annual Income", 
             color_discrete_map={'Poor':'red',
                                 'Standard':'yellow',
                                 'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()

According to the above visualization, the more you earn annually, the better your credit score is. exploring whether the monthly in-hand salary impacts credit scores or not:

In [38]:
fig = px.box(data, 
             x="Credit_Score", 
             y="Monthly_Inhand_Salary", 
             color="Credit_Score",
             title="Credit Scores Based on Monthly Inhand Salary", 
             color_discrete_map={'Poor':'red',
                                 'Standard':'yellow',
                                 'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()

Like annual income, the more monthly in-hand salary you earn, the better your credit score will become. now, checking if having more bank accounts impacts credit scores or not:

In [39]:
fig = px.box(data, 
             x="Credit_Score", 
             y="Num_Bank_Accounts", 
             color="Credit_Score",
             title="Credit Scores Based on Number of Bank Accounts", 
             color_discrete_map={'Poor':'red',
                                 'Standard':'yellow',
                                 'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()

Maintaining more than five accounts is not good for having a good credit score. A person should have 2 – 3 bank accounts only. So having more bank accounts doesn’t positively impact credit scores. Now i will cjeck if the impact on credit scores based on the number of credit cards we have:

In [40]:
fig = px.box(data, 
             x="Credit_Score", 
             y="Num_Credit_Card", 
             color="Credit_Score",
             title="Credit Scores Based on Number of Credit cards", 
             color_discrete_map={'Poor':'red',
                                 'Standard':'yellow',
                                 'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()

Just like the number of bank accounts, having more credit cards will not positively impact your credit scores. Having 3 – 5 credit cards is good for your credit score. checking the impact on credit scores based on how much average interest you pay on loans and EMIs:

In [41]:
fig = px.box(data, 
             x="Credit_Score", 
             y="Interest_Rate", 
             color="Credit_Score",
             title="Credit Scores Based on the Average Interest rates", 
             color_discrete_map={'Poor':'red',
                                 'Standard':'yellow',
                                 'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()

If the average interest rate is 4 – 11%, the credit score is good. Having an average interest rate of more than 15% is bad for your credit scores. Now let’s see how many loans we can take at a time for a good credit score:

In [42]:
fig = px.box(data, 
             x="Credit_Score", 
             y="Num_of_Loan", 
             color="Credit_Score", 
             title="Credit Scores Based on Number of Loans Taken by the Person",
             color_discrete_map={'Poor':'red',
                                 'Standard':'yellow',
                                 'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()

To have a good credit score, you should not take more than 1 – 3 loans at a time. Having more than three loans at a time will negatively impact your credit scores.

In [43]:
# if our monthly investments affect our credit scores or not:

fig = px.box(data, 
             x="Credit_Score", 
             y="Amount_invested_monthly", 
             color="Credit_Score", 
             title="Credit Scores Based on Amount Invested Monthly",
             color_discrete_map={'Poor':'red',
                                 'Standard':'yellow',
                                 'Good':'green'})
fig.update_traces(quartilemethod="exclusive")
fig.show()

The amount of money you invest monthly doesn’t affect your credit scores a lot.

In [49]:
# Feature Analysis and Engineering

# Let's start by checking for missing values in the dataset.
missing_values = data.isnull().sum()
print('Missing values in each column:\n', missing_values)
Missing values in each column:
 ID                          0
Customer_ID                 0
Month                       0
Name                        0
Age                         0
SSN                         0
Occupation                  0
Annual_Income               0
Monthly_Inhand_Salary       0
Num_Bank_Accounts           0
Num_Credit_Card             0
Interest_Rate               0
Num_of_Loan                 0
Type_of_Loan                0
Delay_from_due_date         0
Num_of_Delayed_Payment      0
Changed_Credit_Limit        0
Num_Credit_Inquiries        0
Credit_Mix                  0
Outstanding_Debt            0
Credit_Utilization_Ratio    0
Credit_History_Age          0
Payment_of_Min_Amount       0
Total_EMI_per_month         0
Amount_invested_monthly     0
Payment_Behaviour           0
Monthly_Balance             0
Credit_Score                0
Debt_Income_Ratio           0
dtype: int64
In [50]:
# We will also check the unique values for categorical columns to decide on encoding strategies.
categorical_columns = data.select_dtypes(include=['object']).columns
for col in categorical_columns:
    print(f'Unique values in {col}:', data[col].nunique())
Unique values in Name: 10128
Unique values in Occupation: 15
Unique values in Type_of_Loan: 6261
Unique values in Credit_Mix: 3
Unique values in Payment_of_Min_Amount: 3
Unique values in Payment_Behaviour: 6
Unique values in Credit_Score: 3
In [51]:
# For the purpose of feature engineering,  creating a new feature as an example.
# I am creating a feature that represents the ratio of outstanding debt to annual income.
data['Debt_Income_Ratio'] = data['Outstanding_Debt'] / data['Annual_Income']

# Displaying the head of the dataframe to show the new feature.
print(data[['Outstanding_Debt', 'Annual_Income', 'Debt_Income_Ratio']].head())
   Outstanding_Debt  Annual_Income  Debt_Income_Ratio
0            809.98       19114.12           0.042376
1            809.98       19114.12           0.042376
2            809.98       19114.12           0.042376
3            809.98       19114.12           0.042376
4            809.98       19114.12           0.042376
  • We have analyzed the dataset for missing values and unique counts in categorical columns. Here are the findings:
  1. There are no missing values in the ID and Customer_ID columns.
  2. The 'Name' column has 10,128 unique values.
  3. The 'Occupation' column has 15 unique values.
  4. The 'Type_of_Loan' column has 6,261 unique values, indicating a high variety of loan types.
  5. The 'Credit_Mix' column has 3 unique values.
  6. The 'Payment_of_Min_Amount' column has 3 unique values.
  7. The 'Payment_Behaviour' column has 6 unique values.
  • Additionally, we have created a new feature called 'Debt_Income_Ratio', which represents the ratio of outstanding debt to annual income. However, it seems there was an issue displaying the new feature due to missing values or an incorrect reference in the print statement.

  • Let's correct this and display the new feature along with the 'Outstanding_Debt' and 'Annual_Income' columns to confirm its creation. We will also handle any missing values appropriately.
In [52]:
# Correcting the display of the new feature and handling missing values

# It seems there was an issue with displaying the new feature. Let's fix that.
# We will also handle missing values if there are any.

# First, let's fill any missing values in 'Outstanding_Debt' and 'Annual_Income' with the median of the column
# as an example of handling missing values.
data['Outstanding_Debt'] = data['Outstanding_Debt'].fillna(data['Outstanding_Debt'].median())
data['Annual_Income'] = data['Annual_Income'].fillna(data['Annual_Income'].median())

# Now let's recreate the 'Debt_Income_Ratio' feature
data['Debt_Income_Ratio'] = data['Outstanding_Debt'] / data['Annual_Income']

# Display the head of the dataframe to show the new feature along with 'Outstanding_Debt' and 'Annual_Income'.
print(data[['Outstanding_Debt', 'Annual_Income', 'Debt_Income_Ratio']].head())
   Outstanding_Debt  Annual_Income  Debt_Income_Ratio
0            809.98       19114.12           0.042376
1            809.98       19114.12           0.042376
2            809.98       19114.12           0.042376
3            809.98       19114.12           0.042376
4            809.98       19114.12           0.042376

The 'Debt_Income_Ratio' feature has been successfully created and added to the dataset. This feature represents the ratio of outstanding debt to annual income, which could be a useful indicator for financial health and creditworthiness.

In [53]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder


# Encode the target variable
label_encoder = LabelEncoder()
data['Credit_Score_encoded'] = label_encoder.fit_transform(data['Credit_Score'])

# Select only numeric columns for features
numeric_columns = data.select_dtypes(include=['number']).columns
X = data[numeric_columns].drop(columns=['Credit_Score_encoded'])
y = data['Credit_Score_encoded']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Output the non-numeric columns and the accuracy
print('Non-numeric columns:', data.select_dtypes(exclude=['number']).columns.tolist())
print('Accuracy of the RandomForestClassifier:', accuracy)
Non-numeric columns: ['Name', 'Occupation', 'Type_of_Loan', 'Credit_Mix', 'Payment_of_Min_Amount', 'Payment_Behaviour', 'Credit_Score']
Accuracy of the RandomForestClassifier: 0.83815
  1. Encoding the target variable 'Credit_Score' using LabelEncoder to convert it into a numeric format for modeling.
  2. Selecting only the numeric columns for features and dropping the 'Credit_Score_encoded' column.
  3. Splitting the data into training and testing sets using a 80-20 split ratio.
  4. Initializing and training a Random Forest classifier using the training data.
  5. Making predictions on the testing data and calculating the accuracy of the model using the predicted and actual values.

    • The overall goal of this code is to train a Random Forest classifier to predict the encoded credit scores based on the selected numeric features and evaluate its accuracy.


The non-numeric columns identified in the dataset are 'Name', 'Occupation', 'Type_of_Loan', 'Credit_Mix', 'Payment_of_Min_Amount', 'Payment_Behaviour', and 'Credit_Score'. These were excluded from the feature set used to train the RandomForestClassifier.


After training the model with the remaining numeric data and making predictions on the test set, the accuracy of the RandomForestClassifier is approximately 83.81% /approx(84%). This is a more realistic accuracy score compared to the previously reported perfect score(79%), suggesting that the model is now providing a more credible evaluation of its predictive performance.

Conclusion:¶

The conclusion is that i have successfully trained a Random Forest classifier to predict credit scores based on the selected numeric features. The model's accuracy has been evaluated, and it can be used to make predictions on new data.

Overall, the dataset has been processed, the model has been trained, and its performance has been assessed. This concludes the project, and the trained model can now be used for credit score prediction.

In [ ]: